Python notebook using data from multiple data sources · 1,904 views · 1mo ago·starter code, beginner, data visualization, +2 moreeda, tutorial
Input
Data Sources
Titanic: Machine Learning from Disaster
titanic leaked
Titanic: Machine Learning from Disaster
Last Updated: 8 years ago
Overview
The data has been split into two groups:
- training set (train.csv)
- test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
Data Dictionary
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
Output
submission1.csv
submission2.csv
Thank you for this very informative and humorous kernel. Upvoted 👍
thanks a lot @tomzomac
Upvoted!!
thanks a lot @nirajpoudel
Great kernel :)
thanks @krutarthhd
Thank you very much for this! Lots of great insights in terms of Exploratory Data Analysis. If I may (although you probably already know this one since you mention the tendency of RandomForest not to overfit from training set): you use the entire training set to train your models (which is fine), but you measure the model's "score" based on data used in training to compare models. Wouldn't it be preferable to use unseen data (or, at least, cross-validation) in order to know how the models will behave on the test set?
True. But we have to select only one model. Right? So using the scoring on trained dataset I got to know which model is working on this dataset. In other cases while using different models It's better to ensembling.. But also it is not the proper learning. Just the way to get higher in leaderboard.. Also in this task you can use all these models and execute all the results. After that for each ID you can use these as voting.
Like in a particular id if number of 1 is more than number of 0, you can select 1 as your final answer.
Thank you, that's good insights on how to easily use an ensemble methods (I hadn't thought of the voting system).
Regarding the question from earlier, would it be preferable to split the original training set into a train/validation set and then compare how each model performs on the validation set (and even use the validation set as early stopping in an ANN), OR use the entire training set to benefit from the entire data (because it is quite a small data set to begin with)?
@maxencelemercier sorry for this late reply. Actually the score is calculated by that process. As you can see all models are trained on the whole dataset but the score is calculated by splitting the dataset in 80:20. If you want to change that then obviously you can use sklearn.modelselection.traintest_split. Also you can build the validation matrix to find the proper accuracy. In my next version of this notebook it would be introduced.
Good kernel Natsu, knowlege and humor.
Thanks @ravels1991
Very good Notebook, charts and Dataset. Great sense of humor.
Thanks man!
Nice work:)
Thanks @granjithkumar
Excellent work! Thank you for humorous meme's. It was very funny 😀
i am just started doing ml for 45 days, am i supposed to know all this coding. I can understand all this but when i compare my model with yours, yours is far more complex, you have imputed missing values with great care whereas i just have imputed mode also my model's univariate and bivariate analysis sucks.
you mind referring me some course i can join, i have already completed udemy's machine learning a-z course.
thanks
Great work
Amazing :)
Thanks for the share
@soham1024 hi!👋
Thank you for inspiring research and solution, it was pleasure to read and apply it in my work!👍
Have you considered such features as 'Lucky ticket' (some ticket numbers are more lucky than others) and 'Cabin location' in you following entries?
Thank you!
Hello, I draw the pointplot for the Embarked. You shown that in Embarked C, males survival chance is more i.e percentage. I observed the data and found out its not true. Please check the image. If I am wrong, correct me. Nice code

I got plot as shown below:

Hey Natsu! How's it going with Lucy?? Nice meme kernel, by the way!
Thanks @danoozy44
